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Reducer Capacity. An important parameter to be considered in MapReduce 
algorithms is the “reducer capacity.” A reducer is an application of the reduce 
function to a single key and its associated list of values. The reducer capacity is 
an upper bound on the sum of the sizes of the values that are assigned to the 
reducer. For example, we may choose the reducer capacity to be the size of the 
main memory of the processors on which the reducers run. We assume that all 
the reducers have an identical capacity, denoted by q. 

Motivation and Examples. We demonstrate a new aspect of the reducer 
capacity in the scope of several special cases. One useful special case is where 
an output depends on exactly two inputs. We present two examples where each 
output depends on exactly two inputs and define two problems that are based 
on these examples. 

Similarity-join. Similarity-join is used to find the similarity between any two 
inputs, e.g., Web pages or documents. A set of m inputs ( e.g ., Web pages) WP = 
{wpi,wp 2 , • • •, wp m }, a similarity function sim(x , y ), and a similarity threshold 
t are given, and each pair of inputs {wp x , wp y ) corresponds to one output such 
that sim(wp x ,wp y ) > t. It is necessary to compare all pairs of inputs when the 
similarity measure is sufficiently complex that shortcuts like locality-sensitive 
hashing are not available. Therefore, it is mandatory to compare every two inputs 
(Web pages) of the given input set (WP). 

Skew join of two relations X(A, B) and Y(B,C). The join of relations X (A, B) 
and Y(B,C ), where the joining attribute is B, provides the output tuples (a, 6, c), 
where (a, b) is in X and (b, c) is in Y. One or both of the relations X and Y 
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may have a large number of tuples with the same -B-value. A value of the joining 
attribute B that occurs many times is known as a heavy hitter. In skew join of 
X(A, B) and Y(B, C), all the tuples of both the relations with the same heavy 
hitter should appear together to provide the output tuples. 

Problem Statement. We define two problems where exactly two inputs are 
required for computing an output, as follows: («) All-to-All problem. In the 
all-to-all (A2A) problem, a set of inputs is given, and each pair of inputs 
corresponds to one output. Computing common friends on a social networking 
site and similarity join are examples. (■ ii) X-to-Y problem. In the X-to-Y ( X2Y) 
problem, two disjoint sets X and Y are given, and each pair of elements (x», yj), 
where Xi £ X, yj £ Y, Vi, j, of the sets X and Y corresponds to one output. Skew 
join and outer product or tensor product are examples. 

The communication cost , i.e., the total amount of data transmitted from the 
map phase to the reduce phase, is a significant factor in the performance of 
a MapReduce algorithm. The communication cost comes with tradeoff in the 
degree of parallelism however. Higher parallelism requires more reducers (hence, 
of smaller reducer capacity), and hence a larger communication cost (because 
the copies of the given inputs are required to be assigned to more reducers). A 
substantial level of parallelism can be achieved with fewer reducers, and hence, 
yield a smaller communication cost. Thus, we focus on minimizing the total 
number of reducers, for a given reducer capacity q. A smaller number of reducers 
results in a smaller communication cost. 

Tradeoffs. The following tradeoffs appear in MapReduce algorithms and in 
particular in our setting: (i) a tradeoff between the reducer capacity and the total 
number of reducers, (ii) a tradeoff between the reducer capacity and parallelism, 
and (Hi) a tradeoff between the reducer capacity and the communication cost. 

Mapping Schema. A mapping schema is an assignment of the set of inputs to 
some given reducers under the following two constraints: (z) a reducer is assigned 
inputs whose sum of the sizes is less than or equal to the reducer capacity, and 
(ii) for each output, we must assign the corresponding inputs to at least one 
reducer in common. The following two problems are proved to be NP-compete: 

The A2A Mapping Schema Problem. An instance of the A2A mapping 
schema problem consists of a set of m inputs whose input size set is W = 
{wi,W 2 , ■ ■ ■ ,w m } and a set of 2 reducers of capacity q. A solution to the A2A 
mapping schema problem assigns every pair of inputs to at least one reducer in 
common, without exceeding q at any reducer. 

The X2Y Mapping Schema Problem. An instance of the X2Y mapping 
schema problem consists of two disjoint sets X and Y and a set of z reducers of 
capacity q. The inputs of the set X are of sizes wi,W 2 , ■ ■ ■ ,w m , and the inputs 
of the set Y are of sizes w[, w' 2 ,..., w' n . A solution to the X2Y mapping schema 
problem assigns every two inputs, the first from one set, X, and the second from 
the other set, Y, to at least one reducer in common, without exceeding q at any 
reducer. 



